In this notebook, we will perform an analysis using observational data from the 1985 Current Population Survey (CPS) from the US.

The goal of this analysis is to estimate the effect of education on wages, taking into account potential confounding variables.

The dataset contains information on various demographic, economic, and labor-related variables.

CPS = read.csv("CPS1985.csv")

Exploratory Analysis

head(CPS, 10)
##       X  wage education experience age ethnicity region gender occupation
## 1     1  5.10         8         21  35  hispanic  other female     worker
## 2  1100  4.95         9         42  57      cauc  other female     worker
## 3     2  6.67        12          1  19      cauc  other   male     worker
## 4     3  4.00        12          4  22      cauc  other   male     worker
## 5     4  7.50        12         17  35      cauc  other   male     worker
## 6     5 13.07        13          9  28      cauc  other   male     worker
## 7     6  4.45        10         27  43      cauc  south   male     worker
## 8     7 19.47        12          9  27      cauc  other   male     worker
## 9     8 13.28        16         11  33      cauc  other   male     worker
## 10    9  8.75        12          9  27      cauc  other   male     worker
##           sector union married
## 1  manufacturing    no     yes
## 2  manufacturing    no     yes
## 3  manufacturing    no      no
## 4          other    no      no
## 5          other    no     yes
## 6          other   yes      no
## 7          other    no      no
## 8          other    no      no
## 9  manufacturing    no     yes
## 10         other    no      no

The dataset includes 12 variables that capture individual social and economic characteristics. These variables can be categorized as follows:

Numerical Variables:

Categorical Variables:

print(colSums(is.na(CPS)))
##          X       wage  education experience        age  ethnicity     region 
##          0          0          0          0          0          0          0 
##     gender occupation     sector      union    married 
##          0          0          0          0          0

There’s no missing values across variables, so we can proceed with the analysis.

Let’s see the distribution of quantitative variables.

numerical_vars <- c("wage", "education", "experience", "age")
summary(CPS[numerical_vars])
##       wage          education       experience         age       
##  Min.   : 1.000   Min.   : 2.00   Min.   : 0.00   Min.   :18.00  
##  1st Qu.: 5.250   1st Qu.:12.00   1st Qu.: 8.00   1st Qu.:28.00  
##  Median : 7.780   Median :12.00   Median :15.00   Median :35.00  
##  Mean   : 9.024   Mean   :13.02   Mean   :17.82   Mean   :36.83  
##  3rd Qu.:11.250   3rd Qu.:15.00   3rd Qu.:26.00   3rd Qu.:44.00  
##  Max.   :44.500   Max.   :18.00   Max.   :55.00   Max.   :64.00
print(paste("Sample size:", nrow(CPS)))
## [1] "Sample size: 534"

We can observe that the dataset includes 534 adult people of age ranging from 18 to 64 years, with 2 to 18 years of education and 0 to 55 years of work experience, so the variability among the observed sample of individuals is quite large. The same applies to wages: they range from 1 to 44.5 dollars per hour.

library(ggplot2)
library(tidyverse)
## Warning: package 'forcats' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("ggiraph")
library(ggiraph)
## Warning: package 'ggiraph' was built under R version 4.4.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
CPS_long <- CPS %>%
  pivot_longer(cols = c(wage, education, experience, age), names_to = "variable_name", values_to = "value")

quantiles <- CPS_long %>%
  group_by(variable_name) %>%
  summarise(
    Q1 = quantile(value, 0.25),
    Median = quantile(value, 0.50),
    Q3 = quantile(value, 0.75)
  )

density_plots <- ggplot(data = CPS_long, aes(x = value)) +
  geom_density(fill = "steelblue", color = "black", alpha = 0.7) +
  facet_wrap(~ variable_name, scales = "free") +
  labs(title = "Distribution of Numerical Variables", x = NULL, y = "Frequency") +
  theme_minimal() +
  geom_vline(data = quantiles, aes(xintercept = Q1), color = "blue", linetype = "dashed", size = 0.5) +
  geom_vline(data = quantiles, aes(xintercept = Median), color = "red", linetype = "dashed", size = 1) +
  geom_vline(data = quantiles, aes(xintercept = Q3), color = "blue", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(density_plots, width = 800, height = 600)

As we can see from the plots, the distribution of wages, experience and age is right-skewed, meaning that most individuals have lower values for these variables. The distribution of education is more evenly spread, having median value at around 12 years of education.

Since the variables are skewed, there might be some outliers. Let’s have a look at the boxplots to see if there are any.

boxplots <- ggplot(CPS_long, aes(x = variable_name, y = value)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  facet_wrap(~ variable_name, scales = "free", ncol = 4) +
  labs(title = "Distribution of Numerical Variables", x = NULL, y = "Value") +
  theme_minimal()

ggplotly(boxplots, width = 800, height = 600)

There are visible outliers in the data:

Correlations between wages, education, experience, and age

Let’s now observe the corellation between the numerical variables of interest.

numerical_vars_df <- CPS[numerical_vars]

correlation_matrix <- cor(numerical_vars_df)

correlation_matrix_long <- as.data.frame(correlation_matrix) %>%
  rownames_to_column(var = "Var1") %>%
  gather(key = "Var2", value = "value", -Var1) %>%
  filter(Var1 != Var2) %>%
  filter(!duplicated(paste(pmin(Var1, Var2), pmax(Var1, Var2))))


print(correlation_matrix_long)
##         Var1       Var2       value
## 1  education       wage  0.38192207
## 2 experience       wage  0.08705953
## 3        age       wage  0.17696688
## 4 experience  education -0.35267645
## 5        age  education -0.15001947
## 6        age experience  0.97796125

The independent variable of interest - Wage - is positively correlated with other three numerical variables with the strongest correlation for education (~0.4). Surprisingly, years of experience and age and not strongly correlated with Wage.

The strongest positive correlation is observable between age and work experience, which is not suprising if we assume that almost all individuals were working throughout their lives.

What is also interesting is that education is not strongly correlated with experience, which might indicate that people with higher education levels might not necessarily have more work experience (it makes sense if we assume that individuals who were getting their education for a longer time, might start working later).

Let’s graphically represent the relationships between wage and education, experience, and age.

We chose to apply the Loess smoothing method to this plot because it this method in non-perametric and does not assume a linear relationship between the variables. Instead, Loess fits a smoothed curve that best describes the relationship, allowing us to explore potential non-linear patterns in the data more closely. This approach is particularly suitable since we are uncertain about the exact trend of the relationship.

Additionally, the dataset’s relatively small size means that individual data points can have a significant impact. By using Loess, which is sensitive to outliers and local variations, we can better understand how these factors influence the overall trend between wage and education.

scatterplot_education <- ggplot(CPS, aes(x = education, y = wage)) +
  geom_point(size=1) +
 geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
  labs(title = "Education And Wage Relationship",
       x = "Years of Education",
       y = "Wage") +
  theme_minimal()

ggplotly(scatterplot_education, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

If we consider the scatterplot of wage and education, we can observe a positive relationship between the two variables, that was previously observed in the correlation matrix. Furthermore, the trend is rather consistent and the variability is not too high, which indicates that the relationship might be quite strong. The consistently increasing upward slope might signal that there is also no dimishing returns to more years of educations, at least in the sectors and the occupations that we observe in this dataset.

scatterplot_experience <- ggplot(CPS, aes(x = experience, y = wage)) +
  geom_point(size=1) +
 geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
  labs(title = "Experience And Wage Relationship",
       x = "Years of Experience",
       y = "Wage") +
  theme_minimal()

ggplotly(scatterplot_experience, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Compared to education, work experience does not show a clear relationship with wage: variability is so high that it makes the graph visually noisy and without a clear distinguishable trend. This is consistent with the correlation matrix, which showed a weaker correlation between wage and experience compared to education.

Speaking of a trend in the fitted line - there is an upward slope from 0 to ~12 years of experiences which is follwed by the flat line. It might signal that the positive linearity of the relationship is not constant across the whole career and is higher in first 10-15 years of work experience, on average. There’s also a significant drop in wage for individuals with more than ~40 years of experience, but the number of observations are relatively low in this age agroup and the variability is so high that it does not allow to make any preliminary conclusions.

Since age is highly correlated with experience, we can expect similar results for the relationship between wage and age on average, but the exact trends might differ, so it might be useful to plot it as well:

scatterplot_age <- ggplot(CPS, aes(x = age, y = wage)) +
  geom_point(size=1) +
 geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
  labs(title = "Experience And Wage Relationship",
       x = "Age",
       y = "Wage") +
  theme_minimal()

ggplotly(scatterplot_age, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As with the relationship between wage and experience, the relationship between wage and age is not exactly linearly positive. The fitted line shows a slight upward slope from 18 to 35 years of age, which is followed by a stagnation. It might indicate that wage gains are higher in first years of career, but then the wage does not grow or even decrease a bit. It is also consistent with the trend that we observed between wage and years of experience.

Distribution of Wages Across Different Social Groups

Let’s have a closer look at the qualitative categories that we have in this dataset.

categorical_vars <- c("ethnicity", "region", "gender", "occupation", "sector", "union", "married")

for (var in categorical_vars) {
  cat("Categories in", var, ":\n")
  print(unique(CPS[[var]]))
  cat("\n")
}
## Categories in ethnicity :
## [1] "hispanic" "cauc"     "other"   
## 
## Categories in region :
## [1] "other" "south"
## 
## Categories in gender :
## [1] "female" "male"  
## 
## Categories in occupation :
## [1] "worker"     "management" "sales"      "office"     "services"  
## [6] "technical" 
## 
## Categories in sector :
## [1] "manufacturing" "other"         "construction" 
## 
## Categories in union :
## [1] "no"  "yes"
## 
## Categories in married :
## [1] "yes" "no"

It might be interesting to observe differences in wage distribution across different social groups separately.

Sectors and Occupations

Let’s start with professional occupations and sectors.

We suspect that there is a significant difference in wages among individuals in different occupations and industries, since some sectors and occupations require more advanced and/or specialized skills than others.

wage_by_sector <- ggplot(CPS, aes(x = sector, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Sectors",
       x = "Sector",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_sector, width = 800, height = 600)

Median and interquartile ranges are quite close across sectors. As it could have been expected, “other” category (which includes all data points outside of construction and manufacturing) have the largest and biggest outliers, but the lowest median. Construction sector has the highest median wage and the lowest variability in wages.

summary_sectors <- CPS %>%
  group_by(sector) %>%
  summarise(
    Mean = mean(wage),
    Median = median(wage),
    SD = sd(wage),
    IQR = IQR(wage, na.rm = TRUE),
    Count = n()
  )

summary_sectors
## # A tibble: 3 × 6
##   sector         Mean Median    SD   IQR Count
##   <chr>         <dbl>  <dbl> <dbl> <dbl> <int>
## 1 construction   9.22   9.75  3.36  4.74    24
## 2 manufacturing  9.60   9     4.97  6.27    99
## 3 other          8.87   7.5   5.26  6.06   411

Despite significant difference in number of observations across sectors (“other” category is significantly larger in number of observations), the sectors appear to be comparable in terms of mean and median.

The standard deviation is the highest in the “other” category, which is consistent with the boxplot. The construction sector has the lowest standard deviation.

According to the calculated IQRs across sectors, manufacturing has the highest variability of the middle 50% of the data, which is consistent with the boxplot. Construction sector has the lowest variability, which is also consistent with the boxplot.

Overall, despire rather closed measures of central tendency, the measures of variability are quite different across sectors, which might indicate that sector of employemtn is an important predictor of wage level.

It is also likely to be a common cause confounder for the relationship between education and wages, since people working in different sectors might have different levels of education and wages. it might be especially the case for “other” category since it includes a wide variety of industries.

# install.packages("forcats")
library(forcats)
library(dplyr)

CPS <- CPS %>%
  mutate(occupation = fct_reorder(occupation, wage, .fun = median))

wage_by_occupation <- ggplot(CPS, aes(x = occupation, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Occupations",
       x = "Occupation",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_occupation, width = 800, height = 600)
summary_occupations <- CPS %>%
  group_by(occupation) %>%
  summarise(
    Mean = mean(wage),
    Median = median(wage),
    SD = sd(wage),
    IQR = IQR(wage, na.rm = TRUE),
    Count = n()
  )

summary_occupations
## # A tibble: 6 × 6
##   occupation  Mean Median    SD   IQR Count
##   <fct>      <dbl>  <dbl> <dbl> <dbl> <int>
## 1 services    6.54   5.5   3.67  4.04    83
## 2 sales       7.59   5.72  4.23  6.52    38
## 3 worker      8.43   7.15  4.25  5.76   156
## 4 office      7.42   7.5   2.70  4.3     97
## 5 technical  11.9   10.6   5.52  7.88   105
## 6 management 12.7   10.6   7.57  9.14    55

Highest median and interquartile ranges are observable in the Management and Technical groups, which means they have higher hourly wage for a “typical” worker as well as larger variability in wages. These two types of occupations often require advanced level education and specialized skills, which can explain the higher wages.

Worker and Sales occupations have quite long tails with relatively small numbers of very high earners. Successful sales specialists can have high incomes from commissions, while some highly qualified workers can have much higher hourly wages, than average workers due to their high level or highly specialized skills.

Overall, there is quite high variability, average and median differences in wages across these groups, which might indicate that occupation is a very important predictor of wage level. It is also likely to be a common cause confounder for the relationship between education and wages, since people with higher education levels might be more likely to work in higher paid sectors and occupations (it especially holds true for management and technical groups that might require advanced level education).

Ethnic groups

Let’s now observe the distribution of wages across different demographic groups.

CPS <- CPS %>%
  mutate(ethnicity = fct_reorder(ethnicity, wage, .fun = median))

wage_by_ethnicity <- ggplot(CPS, aes(x = ethnicity, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Ethnic Groups",
       x = "Occupation",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_ethnicity, width = 800, height = 600)
summary_ethnicity <- CPS %>%
  group_by(ethnicity) %>%
  summarise(
    Mean = mean(wage),
    Median = median(wage),
    SD = sd(wage),
    IQR = IQR(wage, na.rm = TRUE),
    Count = n()
  )

summary_ethnicity
## # A tibble: 3 × 6
##   ethnicity  Mean Median    SD   IQR Count
##   <fct>     <dbl>  <dbl> <dbl> <dbl> <int>
## 1 hispanic   7.28    5.2  5.52  3.19    27
## 2 other      8.06    7.5  4.08  5.29    67
## 3 cauc       9.28    8    5.23  6.23   440

Caucasians have both the highest median wage and the highest variability in wages.

Higher variability can be explained by the fact that the majority of the sample is Caucasian, so the distribution of wages is more spread out. Additionally, there might structural differences in incomes between different ethnic groups. For instance, non-white minorities have lower access to higher paid sectors and occupations, which can lead to lower wages.

library(RColorBrewer)

ethnicity_colors <- c("hispanic" = "brown4", "other" = "darkgoldenrod", "cauc" = "cadetblue")

ethnicity_occupation_plot <- ggplot(CPS, aes(x = occupation, fill = ethnicity)) +
  geom_bar(position = "fill") +
  labs(title = "Ethnic Composition Across Occupations",
       x = "Occupation",
       y = "Proportion",
       fill = "Ethnicity") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = ethnicity_colors) +
  theme_minimal()


ggplotly(ethnicity_occupation_plot)

As we can see, among occupation groups, non-causasians’ proportions are the highest in the office, worker and services occupation groups, which are associated with lower wages, as we saw previously.

Generally speaking, ethnic origin might be a very important predictor of a wage level. Additionally, it might also affect the access to education, which means that it is a common cause confounder.

Gender

Let’s observe the distribution of wages across gender. Gender disparity is widely discussed in the economic literature, so it might be interesting to see if there is a wage gap between two genders in this dataset. We would expect a significant difference, especially since the data comes from 80’s when gender pay gap was even more pronounced.

wage_by_gender <- ggplot(CPS, aes(x = gender, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Gender Groups",
       x = "Gender",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_gender, width = 800, height = 600)

Even though both median and interquartile ranges are higher for males, there is quite a lot of outliers in the female groups, indicating a small number of high earners among females.

summary_gender <- CPS %>%
  group_by(gender) %>%
  summarise(
    Mean = mean(wage),
    Median = median(wage),
    SD = sd(wage),
    IQR = IQR(wage, na.rm = TRUE),
    Count = n()
  )

summary_gender
## # A tibble: 2 × 6
##   gender  Mean Median    SD   IQR Count
##   <chr>  <dbl>  <dbl> <dbl> <dbl> <int>
## 1 female  7.88   6.8   4.72  5.25   245
## 2 male    9.99   8.93  5.29  7      289

Despite outliers mentioned earlier, the standard deviation and interquartile range of the male group signal higher variability in wages.

Again, we would expect gender to be a common cause confounder for the relationship between education and wages, because it can affect both the access to education, number of years of formal education (due to different social norms and expectations applied to females) and the wage level (due to glass ceilings).

Location

The data also allows to observe the differences between regions in terms of wages. The dataset includes 2 categories related to location: south and other. We would expect that there is a wage gap between these two regions, since the South is generally considered to be more rural, less developed, resulting in lower average wages.

wage_by_region <- ggplot(CPS, aes(x = region, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Regions",
       x = "Region",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_region, width = 800, height = 600)

It’s vividly visible that both the median and the range are higher in the “other” category, which is consistent with the assumption that the South has lower wages.

summary_region <- CPS %>%
  group_by(region) %>%
  summarise(
    Mean = mean(wage),
    Median = median(wage),
    SD = sd(wage),
    IQR = IQR(wage, na.rm = TRUE),
    Count = n()
  )

summary_region
## # A tibble: 2 × 6
##   region  Mean Median    SD   IQR Count
##   <chr>  <dbl>  <dbl> <dbl> <dbl> <int>
## 1 other   9.49   8.60  5.23  6.37   378
## 2 south   7.90   6.25  4.74  5.46   156

Unionization

Unionization is another important factor that can affect wages: union member are protected by the collective agreements and usually have higher wages.

wage_by_union <- ggplot(CPS, aes(x = union, y = wage)) +
  geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wages Across Union Memberships",
       x = "Union Membership",
       y = "Wage") +
  theme_minimal()

ggplotly(wage_by_union, width = 800, height = 600)

Causal Analysis

After preliminary correlation and exploratory analysis, we can now proceed with the hypothesis testing. Education was found to have the highest positive correlation with wages among quantitative variables (correlations between wages and age and experience were weaker). Qualitative variables, such as occupation and sector, also showed some differences in wage distribution and their relationship with wages might be interesting to explore further.

The set of hypotheses is formulated in the following way:

Let’s have a look at the relationship between the level of education, controlling for experience and age, and wages.

First, since age and experience are highly correlated, we will check for multicollinearity between these two variables, using Variance Inflation Factor (VIF). Let’s create models with and wothout collinear independent variables and compare the results:

# install.packages("stargazer")
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
model_numerical_1 <- lm(wage ~ education + experience + age, data = CPS)
model_numerical_2 <- lm(wage ~ education + experience, data = CPS)
model_numerical_3 <- lm(wage ~ education + age, data = CPS)

stargazer(model_numerical_1, model_numerical_2, model_numerical_3,
          type = "text",
          title = "Wage Vs. Education, Experience & Age",
          align = TRUE,
          column.labels = c("All numerical", "w/o Age", "w/o Experience"))
## 
## Wage Vs. Education, Experience & Age
## ===========================================================================================
##                                               Dependent variable:                          
##                     -----------------------------------------------------------------------
##                                                      wage                                  
##                          All numerical              w/o Age             w/o Experience     
##                               (1)                     (2)                     (3)          
## -------------------------------------------------------------------------------------------
## education                    0.948                 0.926***                0.821***        
##                             (1.155)                 (0.081)                 (0.077)        
##                                                                                            
## experience                   0.128                 0.105***                                
##                             (1.156)                 (0.017)                                
##                                                                                            
## age                         -0.022                                         0.105***        
##                             (1.155)                                         (0.017)        
##                                                                                            
## Constant                    -4.770                 -4.904***               -5.534***       
##                             (7.043)                 (1.219)                 (1.279)        
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                  534                     534                     534          
## R2                           0.202                   0.202                   0.202         
## Adjusted R2                  0.198                   0.199                   0.199         
## Residual Std. Error    4.604 (df = 530)        4.599 (df = 531)        4.599 (df = 531)    
## F Statistic         44.727*** (df = 3; 530) 67.217*** (df = 2; 531) 67.210*** (df = 2; 531)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

First model includes all three variables, while the second model includes only education and experience (excluding age). The third model includes only education and age (excluding experience).

First model demonstrates highly inflated standard errors, most likely due to multicollinearity between experience and age.

Both 2nd and 2rd models show and statistical and quite high correlation between education and wage, while the respective coefficients for isolated experience and age variables are quite similar. it is indeed a sign of multicollinearity.

Let’s see the Variance Inflation Factor to check the power of multicollinearity in the 1st model:

# install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.4.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.1
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif1 <- vif(model_numerical_1)
print(vif1)
##  education experience        age 
##   229.5738  5147.9190  4611.4008

The calculated VIF is extremely high, which indicates that multicollinearity is a significant issue in the model.

Let’s check VIF for the 2nd and 3rd models:

vif2 <- vif(model_numerical_2)
print(vif2)
##  education experience 
##   1.142049   1.142049
vif3 <- vif(model_numerical_3)
print(vif3)
## education       age 
##  1.023024  1.023024

Both models have VIF values ~1, which is a sign of no multicollinearity.

Let’s observe the correlations between wage and socio-economic characteristics of the individuals.

model_categorical <- lm(wage ~ ethnicity + region + gender + occupation + sector 
+ union + married, data = CPS)

stargazer(model_categorical, type = "text", title = "Wage Vs. Categorical variables", align = TRUE)
## 
## Wage Vs. Categorical variables
## ================================================
##                          Dependent variable:    
##                      ---------------------------
##                                 wage            
## ------------------------------------------------
## ethnicityother                  0.771           
##                                (1.023)          
##                                                 
## ethnicitycauc                  1.623*           
##                                (0.893)          
##                                                 
## regionsouth                    -0.824*          
##                                (0.436)          
##                                                 
## gendermale                    2.028***          
##                                (0.434)          
##                                                 
## occupationsales                 0.761           
##                                (0.893)          
##                                                 
## occupationworker                0.188           
##                                (0.704)          
##                                                 
## occupationoffice               1.388**          
##                                (0.678)          
##                                                 
## occupationtechnical           4.752***          
##                                (0.669)          
##                                                 
## occupationmanagement          5.691***          
##                                (0.794)          
##                                                 
## sectormanufacturing             0.714           
##                                (1.033)          
##                                                 
## sectorother                    -0.564           
##                                (1.003)          
##                                                 
## unionyes                      2.018***          
##                                (0.531)          
##                                                 
## marriedyes                      0.608           
##                                (0.413)          
##                                                 
## Constant                      4.392***          
##                                (1.480)          
##                                                 
## ------------------------------------------------
## Observations                     534            
## R2                              0.261           
## Adjusted R2                     0.243           
## Residual Std. Error       4.471 (df = 520)      
## F Statistic           14.162*** (df = 13; 520)  
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

Since it is not an entire model of the relationship we are interested in, we can’t make any conclusions about the effect of these variables on wages. However, it is valuable to make preliminary observations on how these common cause confounders correlated with the wages.

As we can see:

This check is consistent with the assumptions we had about the confounding variables before. So far the occupation is the strongest predictor of wage level among other confounders. It might also indicate that education indeed translates into higher wages, because technical and manegerial positions usually require higher level of education. At the same time, there is some association with the social position of the individual: ethnicity, location, gender and union membership are also associated with wage level.

Models with confounderd included

Let’s run the models with the confounders included. We don’t include age due to its high collinearity with experience.

# num + gender
model_full1 <- lm(wage ~ education + experience + gender, data = CPS)
# num + ethnicity 
model_full2 <- lm(wage ~ education + experience + ethnicity, data = CPS)
# num + gender + ethnicity 
model_full3 <- lm(wage ~ education + experience + gender + ethnicity, data = CPS)

stargazer(model_full1, model_full2, model_full3, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ===========================================================================================
##                                               Dependent variable:                          
##                     -----------------------------------------------------------------------
##                                                      wage                                  
##                               (1)                     (2)                     (3)          
## -------------------------------------------------------------------------------------------
## education                  0.941***                0.915***                0.930***        
##                             (0.079)                 (0.082)                 (0.080)        
##                                                                                            
## experience                 0.113***                0.105***                0.113***        
##                             (0.017)                 (0.017)                 (0.017)        
##                                                                                            
## gendermale                 2.338***                                        2.356***        
##                             (0.388)                                         (0.388)        
##                                                                                            
## ethnicityother                                      -0.441                  -0.622         
##                                                     (1.054)                 (1.020)        
##                                                                                            
## ethnicitycauc                                        0.409                   0.336         
##                                                     (0.923)                 (0.893)        
##                                                                                            
## Constant                   -6.505***               -5.041***               -6.576***       
##                             (1.210)                 (1.403)                 (1.382)        
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                  534                     534                     534          
## R2                           0.253                   0.205                   0.257         
## Adjusted R2                  0.249                   0.199                   0.250         
## Residual Std. Error    4.454 (df = 530)        4.599 (df = 529)        4.451 (df = 528)    
## F Statistic         59.885*** (df = 3; 530) 34.131*** (df = 4; 529) 36.527*** (df = 5; 528)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

In these models we have added social confounders that are hypethesized to be important:

Now, let’s examine the model with a categorical variable for occupation included. We expect that this variable will be a significant predictor of wage level, since it is a common cause confounder for the relationship between education and wages.

model_full4 <- lm(wage ~ education + experience + gender + occupation, data = CPS)

stargazer(model_full1, model_full4, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ====================================================================
##                                    Dependent variable:              
##                      -----------------------------------------------
##                                           wage                      
##                                (1)                     (2)          
## --------------------------------------------------------------------
## education                   0.941***                0.717***        
##                              (0.079)                 (0.099)        
##                                                                     
## experience                  0.113***                0.103***        
##                              (0.017)                 (0.017)        
##                                                                     
## gendermale                  2.338***                1.965***        
##                              (0.388)                 (0.416)        
##                                                                     
## occupationsales                                      -0.199         
##                                                      (0.862)        
##                                                                     
## occupationworker                                     1.427**        
##                                                      (0.612)        
##                                                                     
## occupationoffice                                      0.577         
##                                                      (0.665)        
##                                                                     
## occupationtechnical                                 2.818***        
##                                                      (0.738)        
##                                                                     
## occupationmanagement                                3.840***        
##                                                      (0.806)        
##                                                                     
## Constant                    -6.505***               -4.662***       
##                              (1.210)                 (1.397)        
##                                                                     
## --------------------------------------------------------------------
## Observations                   534                     534          
## R2                            0.253                   0.302         
## Adjusted R2                   0.249                   0.291         
## Residual Std. Error     4.454 (df = 530)        4.327 (df = 525)    
## F Statistic          59.885*** (df = 3; 530) 28.351*** (df = 8; 525)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

In the last model “services” occupation is designated as a reference category. The coefficients for the other categories are interpreted as the difference in wage level compared to “services”. Since the “services” had the lowest median and the interquartile range, it is a good fit for a reference category as the group with lowest earnings. As expected, technical and managerial positions are associated with higher wages, increasing the wage by ~2.8 and ~3.8 dollars respectively on average. SO far, this relationship is higher than education, experience and gender.

The model with occupation included shows a jump in R squared (from 0.253 to 0.302) and significant change in the coefficents for education and gender: the magnitude of both coefficents decreased after controlling for the occupation.

There is a chance of collinearity between occupation and education, that should be checked:

vif4 <- vif(model_full4)
print(vif4)
##                GVIF Df GVIF^(1/(2*Df))
## education  1.910504  1        1.382210
## experience 1.188824  1        1.090332
## gender     1.227183  1        1.107783
## occupation 2.056254  5        1.074751

Occupation shows low multicollinearity, which is acceptable.

# install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.1
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.1
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(model_full1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_full1
## BP = 9.0939, df = 3, p-value = 0.02807
bptest(model_full4)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_full4
## BP = 22.39, df = 8, p-value = 0.004242

Breusch-Pagan test for heteroscedasticity shows that both models have heteroscedasticity, which means that the standard errors are not constant. It might be a sign of omitted variables.

plot(model_full4$fitted.values, model_full4$residuals,
     main = "Residuals vs Fitted: Model with Gender & Occupation",
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

Let’s now add new confounders to the model 4.

model_full5 <- lm(wage ~ education + experience + gender + occupation + sector, data = CPS)
stargazer(model_full4 ,model_full5, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## =====================================================================
##                                    Dependent variable:               
##                      ------------------------------------------------
##                                            wage                      
##                                (1)                     (2)           
## ---------------------------------------------------------------------
## education                   0.717***                 0.710***        
##                              (0.099)                 (0.099)         
##                                                                      
## experience                  0.103***                 0.100***        
##                              (0.017)                 (0.017)         
##                                                                      
## gendermale                  1.965***                 2.025***        
##                              (0.416)                 (0.419)         
##                                                                      
## occupationsales              -0.199                   -0.284         
##                              (0.862)                 (0.862)         
##                                                                      
## occupationworker             1.427**                  0.907          
##                              (0.612)                 (0.679)         
##                                                                      
## occupationoffice              0.577                   0.529          
##                              (0.665)                 (0.665)         
##                                                                      
## occupationtechnical         2.818***                 2.723***        
##                              (0.738)                 (0.740)         
##                                                                      
## occupationmanagement        3.840***                 3.754***        
##                              (0.806)                 (0.807)         
##                                                                      
## sectormanufacturing                                   0.479          
##                                                      (0.996)         
##                                                                      
## sectorother                                           -0.539         
##                                                      (0.973)         
##                                                                      
## Constant                    -4.662***                -4.029**        
##                              (1.397)                 (1.712)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                   534                     534           
## R2                            0.302                   0.306          
## Adjusted R2                   0.291                   0.293          
## Residual Std. Error     4.327 (df = 525)         4.321 (df = 523)    
## F Statistic          28.351*** (df = 8; 525) 23.085*** (df = 10; 523)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Adding sector as a predictor does not seem to produce any significant changes in the model. The coefficients for other predictors remain quite similar, while sector dummy variables are not significant.

model_full6 <- lm(wage ~ education + experience + gender + occupation + region, data = CPS)
stargazer(model_full4, model_full6, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ====================================================================
##                                    Dependent variable:              
##                      -----------------------------------------------
##                                           wage                      
##                                (1)                     (2)          
## --------------------------------------------------------------------
## education                   0.717***                0.694***        
##                              (0.099)                 (0.100)        
##                                                                     
## experience                  0.103***                0.101***        
##                              (0.017)                 (0.016)        
##                                                                     
## gendermale                  1.965***                1.990***        
##                              (0.416)                 (0.415)        
##                                                                     
## occupationsales              -0.199                  -0.137         
##                              (0.862)                 (0.861)        
##                                                                     
## occupationworker             1.427**                 1.430**        
##                              (0.612)                 (0.611)        
##                                                                     
## occupationoffice              0.577                   0.638         
##                              (0.665)                 (0.664)        
##                                                                     
## occupationtechnical         2.818***                2.825***        
##                              (0.738)                 (0.737)        
##                                                                     
## occupationmanagement        3.840***                3.832***        
##                              (0.806)                 (0.804)        
##                                                                     
## regionsouth                                          -0.791*        
##                                                      (0.417)        
##                                                                     
## Constant                    -4.662***               -4.139***       
##                              (1.397)                 (1.421)        
##                                                                     
## --------------------------------------------------------------------
## Observations                   534                     534          
## R2                            0.302                   0.306         
## Adjusted R2                   0.291                   0.295         
## Residual Std. Error     4.327 (df = 525)        4.316 (df = 524)    
## F Statistic          28.351*** (df = 8; 525) 25.726*** (df = 9; 524)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

Adding region (which shows either living in the North or in the South of the US) as a predictor shows that living in the South is associated with lower wages. It has also slighly altered the magnitude of other predictors, but their coefficients remain significant.

model_full7 <- lm(wage ~ education + experience + gender + occupation + region + union, data = CPS)
stargazer(model_full6, model_full7, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## =====================================================================
##                                    Dependent variable:               
##                      ------------------------------------------------
##                                            wage                      
##                                (1)                     (2)           
## ---------------------------------------------------------------------
## education                   0.694***                 0.672***        
##                              (0.100)                 (0.099)         
##                                                                      
## experience                  0.101***                 0.094***        
##                              (0.016)                 (0.017)         
##                                                                      
## gendermale                  1.990***                 1.845***        
##                              (0.415)                 (0.415)         
##                                                                      
## occupationsales              -0.137                   0.173          
##                              (0.861)                 (0.860)         
##                                                                      
## occupationworker             1.430**                 1.349**         
##                              (0.611)                 (0.607)         
##                                                                      
## occupationoffice              0.638                   0.801          
##                              (0.664)                 (0.661)         
##                                                                      
## occupationtechnical         2.825***                 2.880***        
##                              (0.737)                 (0.731)         
##                                                                      
## occupationmanagement        3.832***                 4.148***        
##                              (0.804)                 (0.805)         
##                                                                      
## regionsouth                  -0.791*                 -0.689*         
##                              (0.417)                 (0.415)         
##                                                                      
## unionyes                                             1.517***        
##                                                      (0.508)         
##                                                                      
## Constant                    -4.139***               -4.014***        
##                              (1.421)                 (1.411)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                   534                     534           
## R2                            0.306                   0.318          
## Adjusted R2                   0.295                   0.305          
## Residual Std. Error     4.316 (df = 524)         4.284 (df = 523)    
## F Statistic          25.726*** (df = 9; 524) 24.394*** (df = 10; 523)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Being a union member appears to have a significant posaitive relationship with the wage level (as was theorized before constructing the model). Adding predictor for union membership has also slightly altered the magnitude of other predictors, but their coefficients still remain significant.

Interestingly, the magnitude affected the most is the occupation: the coefficients for managerial positions have increased.

model_full8 <- lm(wage ~ education + experience + gender + occupation + region + union + married, data = CPS)
stargazer(model_full7, model_full8, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ======================================================================
##                                     Dependent variable:               
##                      -------------------------------------------------
##                                            wage                       
##                                (1)                      (2)           
## ----------------------------------------------------------------------
## education                    0.672***                 0.672***        
##                              (0.099)                  (0.099)         
##                                                                       
## experience                   0.094***                 0.090***        
##                              (0.017)                  (0.017)         
##                                                                       
## gendermale                   1.845***                 1.852***        
##                              (0.415)                  (0.415)         
##                                                                       
## occupationsales               0.173                    0.099          
##                              (0.860)                  (0.865)         
##                                                                       
## occupationworker             1.349**                  1.316**         
##                              (0.607)                  (0.608)         
##                                                                       
## occupationoffice              0.801                    0.780          
##                              (0.661)                  (0.662)         
##                                                                       
## occupationtechnical          2.880***                 2.822***        
##                              (0.731)                  (0.735)         
##                                                                       
## occupationmanagement         4.148***                 4.098***        
##                              (0.805)                  (0.808)         
##                                                                       
## regionsouth                  -0.689*                  -0.697*         
##                              (0.415)                  (0.415)         
##                                                                       
## unionyes                     1.517***                 1.487***        
##                              (0.508)                  (0.510)         
##                                                                       
## marriedyes                                             0.338          
##                                                       (0.410)         
##                                                                       
## Constant                    -4.014***                -4.124***        
##                              (1.411)                  (1.418)         
##                                                                       
## ----------------------------------------------------------------------
## Observations                   534                      534           
## R2                            0.318                    0.319          
## Adjusted R2                   0.305                    0.305          
## Residual Std. Error      4.284 (df = 523)         4.286 (df = 522)    
## F Statistic          24.394*** (df = 10; 523) 22.224*** (df = 11; 522)
## ======================================================================
## Note:                                      *p<0.1; **p<0.05; ***p<0.01

Being married does not appear to be a significant predictor of wage level in this model, which is consistent with our preliminary assumptions.

Let’s now compare the model with statistically significant predictors and confounders and the model with all confounders included:

model_full9 <- lm(wage ~ education + experience + gender + occupation + region + union + ethnicity + married + sector, data = CPS)
stargazer(model_full7, model_full9, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ======================================================================
##                                     Dependent variable:               
##                      -------------------------------------------------
##                                            wage                       
##                                (1)                      (2)           
## ----------------------------------------------------------------------
## education                    0.672***                 0.655***        
##                              (0.099)                  (0.101)         
##                                                                       
## experience                   0.094***                 0.087***        
##                              (0.017)                  (0.017)         
##                                                                       
## gendermale                   1.845***                 1.939***        
##                              (0.415)                  (0.418)         
##                                                                       
## occupationsales               0.173                    -0.087         
##                              (0.860)                  (0.868)         
##                                                                       
## occupationworker             1.349**                   0.687          
##                              (0.607)                  (0.678)         
##                                                                       
## occupationoffice              0.801                    0.707          
##                              (0.661)                  (0.663)         
##                                                                       
## occupationtechnical          2.880***                 2.649***        
##                              (0.731)                  (0.742)         
##                                                                       
## occupationmanagement         4.148***                 3.977***        
##                              (0.805)                  (0.810)         
##                                                                       
## regionsouth                  -0.689*                   -0.564         
##                              (0.415)                  (0.419)         
##                                                                       
## unionyes                     1.517***                 1.601***        
##                              (0.508)                  (0.512)         
##                                                                       
## ethnicityother                                         -0.230         
##                                                       (0.991)         
##                                                                       
## ethnicitycauc                                          0.608          
##                                                       (0.869)         
##                                                                       
## marriedyes                                             0.297          
##                                                       (0.410)         
##                                                                       
## sectormanufacturing                                    0.562          
##                                                       (0.991)         
##                                                                       
## sectorother                                            -0.478         
##                                                       (0.965)         
##                                                                       
## Constant                    -4.014***                 -3.872**        
##                              (1.411)                  (1.840)         
##                                                                       
## ----------------------------------------------------------------------
## Observations                   534                      534           
## R2                            0.318                    0.326          
## Adjusted R2                   0.305                    0.307          
## Residual Std. Error      4.284 (df = 523)         4.278 (df = 518)    
## F Statistic          24.394*** (df = 10; 523) 16.738*** (df = 15; 518)
## ======================================================================
## Note:                                      *p<0.1; **p<0.05; ***p<0.01

It seems that controlling for ethnicity, sector of occupation and being married does not affect the model. The coefficients for the other predictors remain quite similar, while the new predictors are not significant. In addition, the adjusted R squared value is almost the same in both models. The same applies to Residual Standard Error, which is quite similar in both models.

Model Diagnostics

library(car)
print(vif(model_full7))
##                GVIF Df GVIF^(1/(2*Df))
## education  1.948499  1        1.395887
## experience 1.220804  1        1.104900
## gender     1.245543  1        1.116039
## occupation 2.177498  5        1.080926
## region     1.036379  1        1.018027
## union      1.108663  1        1.052931

VIF values are quite low, which indicates that multicollinearity is not an issue in this model, after excluding age.

plot(model_full7$fitted.values, residuals(model_full7), 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red", lty = 2)

The residuals vs. fitted values plot shows no signs of non-linear distribution of residuals, which is a sign that the model linearity is correctly specified with all introduced variables.

However, due to the widening of the of the scatter, there is a chance that the homoscedasticity assumption is not actually met.

When we previously plotted the distribution of wage, it was right-skewed, which might have caused the residuals to be heteroscedastic. Let’s apply the log transformation to the dependent variable and see if it improves the model fit.

CPS$log_wage <- log(CPS$wage)
log_wage_plot <- ggplot(CPS, aes(x = log_wage)) +
  geom_density(fill = "skyblue", color = "black") +
  labs(title = "Density Plot of Log-Transformed Wage",
       x = "Log of Wage",
       y = "Density") +
  theme_minimal()

ggplotly(log_wage_plot, width = 800, height = 600)

The distribution of log-transformed wage is more symmetrical and closer to the normal distribution, which might help to solve the problem of heteroscedasticity.

model_full7_log <- lm(log_wage ~ education + experience + gender + occupation + region + union, data = CPS)
stargazer(model_full7, model_full7_log, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## ===========================================================
##                                    Dependent variable:     
##                                ----------------------------
##                                     wage        log_wage   
##                                     (1)            (2)     
## -----------------------------------------------------------
## education                         0.672***      0.069***   
##                                   (0.099)        (0.010)   
##                                                            
## experience                        0.094***      0.011***   
##                                   (0.017)        (0.002)   
##                                                            
## gendermale                        1.845***      0.208***   
##                                   (0.415)        (0.042)   
##                                                            
## occupationsales                    0.173          0.054    
##                                   (0.860)        (0.086)   
##                                                            
## occupationworker                  1.349**       0.200***   
##                                   (0.607)        (0.061)   
##                                                            
## occupationoffice                   0.801        0.187***   
##                                   (0.661)        (0.066)   
##                                                            
## occupationtechnical               2.880***      0.361***   
##                                   (0.731)        (0.073)   
##                                                            
## occupationmanagement              4.148***      0.404***   
##                                   (0.805)        (0.081)   
##                                                            
## regionsouth                       -0.689*       -0.106**   
##                                   (0.415)        (0.042)   
##                                                            
## unionyes                          1.517***      0.206***   
##                                   (0.508)        (0.051)   
##                                                            
## Constant                         -4.014***      0.639***   
##                                   (1.411)        (0.142)   
##                                                            
## -----------------------------------------------------------
## Observations                        534            534     
## R2                                 0.318          0.348    
## Adjusted R2                        0.305          0.336    
## Residual Std. Error (df = 523)     4.284          0.430    
## F Statistic (df = 10; 523)       24.394***      27.973***  
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01

The residual standard error is much lower in the model with log-transformed wage, which is a sign of a better fit for the data.

plot(model_full7_log$fitted.values, residuals(model_full7_log), 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red", lty = 2)

The visual check of the transformed model shows that the residuals are evenly distributed around the zero line, which is a sign of homoscedasticity.

qqnorm(residuals(model_full7_log))
qqline(residuals(model_full7_log))

The Q-Q plot shows that the residuals are normally distributed, which is a sign that the model assumptions are met.

par(mfrow=c(2,2))
plot(model_full7_log)

print(vif(model_full7_log))
##                GVIF Df GVIF^(1/(2*Df))
## education  1.948499  1        1.395887
## experience 1.220804  1        1.104900
## gender     1.245543  1        1.116039
## occupation 2.177498  5        1.080926
## region     1.036379  1        1.018027
## union      1.108663  1        1.052931

As with non-log-transformed model, there is no signs of multicollinearity, since we have omitted the age variable.

Let’s observe the influential observations in the model.

plot(cooks.distance(model_full7_log), type = "h", pch = 20, main = "Cook's Distance Plot")
abline(h = 4/length(model_full7_log$residuals), col = "red")

The plotted Cook’s distance shows a number of influential observations. Let’s see which observations are the most influential.

cooks_d <- cooks.distance(model_full7_log)
influential_points <- which(cooks_d > 4/length(cooks_d))
print(influential_points)
##  15  76 105 108 166 168 171 179 200 220 231 237 247 277 339 354 356 379 410 414 
##  15  76 105 108 166 168 171 179 200 220 231 237 247 277 339 354 356 379 410 414 
## 416 469 476 478 482 495 533 
## 416 469 476 478 482 495 533
influential_data <- CPS[c(15, 76, 105, 108, 166, 168, 171, 179, 200, 220, 231, 237, 247, 277, 339, 354, 356, 379, 410, 414, 416, 469, 476, 478, 482, 495, 533), ]
print(influential_data)
##       X  wage education experience age ethnicity region gender occupation
## 15   14 19.98         9         29  44      cauc  south   male     worker
## 76   75  3.00         6         43  55  hispanic  other female     worker
## 105 104 18.50        14         13  33      cauc  other female     worker
## 108 107 14.00         5         44  55      cauc  south   male     worker
## 166 165  3.35        16          9  31      cauc  other   male management
## 168 167  4.84        15          1  22      cauc  other   male management
## 171 170 44.50        14          1  21      cauc  other female management
## 179 178  3.64        12         42  60      cauc  other female management
## 200 199  1.00        12         24  42      cauc  other   male management
## 220 219  3.35        16         14  36     other  south female      sales
## 231 230 19.98        14         44  64      cauc  south   male      sales
## 237 236 14.29        14         32  52      cauc  other female      sales
## 247 246  3.50        12         34  52      cauc  other   male      sales
## 277 276  3.35        16          3  25     other  other   male     office
## 339 338  3.50        12         35  53      cauc  other   male     office
## 354 353 13.10        10         38  54     other  south female   services
## 356 355  3.50         9         48  63      cauc  other   male   services
## 379 378  1.75        12          5  23      cauc  south female   services
## 410 409 25.00        14          4  24  hispanic  other   male   services
## 414 413  3.50         9         47  62      cauc  other   male   services
## 416 415  2.01        13          0  19      cauc  south   male   services
## 469 468  8.00        18         33  57      cauc  other   male  technical
## 476 475  5.50        16         29  51      cauc  south   male  technical
## 478 477  6.25        18         27  51     other  other   male  technical
## 482 481  7.00        18         33  57      cauc  other   male  technical
## 495 494 24.98        16          5  27      cauc  south female  technical
## 533 532 19.88        12         13  31      cauc  south   male  technical
##            sector union married  log_wage
## 15          other    no     yes 2.9947318
## 76  manufacturing   yes     yes 1.0986123
## 105 manufacturing    no      no 2.9177707
## 108  construction    no     yes 2.6390573
## 166         other    no      no 1.2089603
## 168         other   yes     yes 1.5769147
## 171         other    no      no 3.7954892
## 179         other    no     yes 1.2919837
## 200         other    no     yes 0.0000000
## 220         other    no     yes 1.2089603
## 231         other    no     yes 2.9947318
## 237         other    no     yes 2.6595600
## 247         other    no     yes 1.2527630
## 277         other    no      no 1.2089603
## 339         other    no     yes 1.2527630
## 354         other    no     yes 2.5726122
## 356         other    no      no 1.2527630
## 379         other    no     yes 0.5596158
## 410         other   yes      no 3.2188758
## 414         other   yes     yes 1.2527630
## 416         other    no      no 0.6981347
## 469         other   yes      no 2.0794415
## 476         other    no     yes 1.7047481
## 478         other    no     yes 1.8325815
## 482         other    no     yes 1.9459101
## 495         other    no      no 3.2180755
## 533         other   yes     yes 2.9897142

As we can see, these observations present the wages outside of the Interquartile Range, getting either below or beyond it. Overall, these are perfectly legitimate data points, so they should not be removed from the dataset. However, it might be valuable to construct a model that is more robust to outliers.

Let’s use Robust Linear Model (RLM), since it is less sensitive to extreme values.

library(MASS)
## Warning: package 'MASS' was built under R version 4.4.1
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
robust_model_full7_log <- rlm(log_wage ~ education + experience + gender + occupation + region + union, data = CPS)
stargazer(model_full7_log, robust_model_full7_log, type = "text", title = "Wage Level predictors", align = TRUE)
## 
## Wage Level predictors
## =================================================================
##                                       Dependent variable:        
##                                ----------------------------------
##                                             log_wage             
##                                          OLS             robust  
##                                                          linear  
##                                          (1)               (2)   
## -----------------------------------------------------------------
## education                              0.069***         0.071*** 
##                                        (0.010)           (0.010) 
##                                                                  
## experience                             0.011***         0.011*** 
##                                        (0.002)           (0.002) 
##                                                                  
## gendermale                             0.208***         0.246*** 
##                                        (0.042)           (0.041) 
##                                                                  
## occupationsales                         0.054             0.056  
##                                        (0.086)           (0.085) 
##                                                                  
## occupationworker                       0.200***         0.182*** 
##                                        (0.061)           (0.060) 
##                                                                  
## occupationoffice                       0.187***         0.208*** 
##                                        (0.066)           (0.065) 
##                                                                  
## occupationtechnical                    0.361***         0.361*** 
##                                        (0.073)           (0.072) 
##                                                                  
## occupationmanagement                   0.404***         0.443*** 
##                                        (0.081)           (0.079) 
##                                                                  
## regionsouth                            -0.106**         -0.120***
##                                        (0.042)           (0.041) 
##                                                                  
## unionyes                               0.206***         0.207*** 
##                                        (0.051)           (0.050) 
##                                                                  
## Constant                               0.639***         0.591*** 
##                                        (0.142)           (0.139) 
##                                                                  
## -----------------------------------------------------------------
## Observations                             534               534   
## R2                                      0.348                    
## Adjusted R2                             0.336                    
## Residual Std. Error (df = 523)          0.430             0.424  
## F Statistic                    27.973*** (df = 10; 523)          
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

The coefficients of the model are quite similar to the OLS model and so are the standard errors.

To do

Limitations